Optimized text for full unicode and some escape sequences (#129169)#7
Optimized text for full unicode and some escape sequences (#129169)#7MitchLewis930 wants to merge 1 commit intopr_017_beforefrom
Conversation
…9169) Follow-up to elastic#126492 to apply the json parsing optimization to strings containing unicode characters and some backslash-escaped characters. Supporting backslash-escaped strings is tricky as it requires modifying the string. There are two types of modification: some just remove the backslash (e.g. \", \\), and some replace the whole escape sequence with a new character (e.g. \n, \r, \u00e5). In this implementation, the optimization only supports the first case--removing the backslash. This is done by making a copy of the data, skipping the backslash. It should still be more optimized than full String decoding, but it won't be as fast as non-backslashed strings where we can directly reference the input bytes. Relates to elastic#129072.
📝 WalkthroughWalkthroughThe JSON string parsing in ESUTF8StreamJsonParser is refactored from linear while-loop processing to a state machine-based approach using switch statements. This tracks escape sequences and string length separately, reconstructing output by removing escape characters when needed. Tests are rewritten with randomized input generation to validate the new parsing behavior. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
|
@CodeRabbit full review |
✅ Actions performedFull review triggered. |
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In
`@libs/x-content/impl/src/main/java/org/elasticsearch/xcontent/provider/json/ESUTF8StreamJsonParser.java`:
- Around line 52-57: getValueAsText() currently returns a raw slice from
_inputBuffer when stringEnd > 0 which preserves escape sequences, causing
inconsistency after _finishAndReturnText() rebuilds de-escaped bytes; modify
getValueAsText() to detect when the token contains backslashes (or when
_finishAndReturnText() previously rebuilt bytes) and return the cached
de-escaped bytes instead of the original slice: call or reuse
_finishAndReturnText() (or store its result in a field like cachedTextBytes)
when escapes are present, and in the stringEnd fast path return a Text backed by
those de-escaped bytes (use identifiers _finishAndReturnText, cached buffer
field, stringEnd, _inputBuffer, _inputPtr, stringLength, _tokenIncomplete to
locate code).
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
libs/x-content/impl/src/main/java/org/elasticsearch/xcontent/provider/json/ESUTF8StreamJsonParser.javalibs/x-content/impl/src/test/java/org/elasticsearch/xcontent/provider/json/ESUTF8StreamJsonParserTests.java
🔇 Additional comments (7)
libs/x-content/impl/src/main/java/org/elasticsearch/xcontent/provider/json/ESUTF8StreamJsonParser.java (2)
24-31: State tracking additions look good.The added length tracking and backslash bookkeeping are appropriate for the optimized path.
74-138: Switch-based parsing and buffer rebuild logic look solid.The state machine and the split between direct-slice and rebuilt-buffer paths are clear and maintain the optimized intent.
libs/x-content/impl/src/test/java/org/elasticsearch/xcontent/provider/json/ESUTF8StreamJsonParserTests.java (5)
23-23: LGTM.The Locale import is used by the new escape formatting.
40-95: Test updates for optimizedTextpath are clear and thorough.The assertions now explicitly validate the Text-based path across escaped and unescaped cases.
134-145: Good addition of input/result modeling for escapes.The
TestInputrecord andESCAPE_SEQUENCEStable make test intent explicit.
146-191: Random input generator is well-structured.The helper methods capture both optimized and fallback paths cleanly.
194-231: Randomized test flow aligns with new optimized behavior.The branch logic matches the expected contract for
getValueAsText()vsgetValueAsString().
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
| public Text getValueAsText() throws IOException { | ||
| if (_currToken == JsonToken.VALUE_STRING && _tokenIncomplete) { | ||
| if (stringEnd > 0) { | ||
| final int len = stringEnd - 1 - _inputPtr; | ||
| // For now, we can use `len` for `stringLength` because we only support ascii-encoded unescaped strings, | ||
| // which means each character uses exactly 1 byte. | ||
| return new Text(new XContentString.UTF8Bytes(_inputBuffer, _inputPtr, len), len); | ||
| return new Text(new XContentString.UTF8Bytes(_inputBuffer, _inputPtr, len), stringLength); | ||
| } |
There was a problem hiding this comment.
Ensure repeated getValueAsText() calls return de‑escaped bytes.
When backslashes are present, _finishAndReturnText() returns a rebuilt buffer without escapes, but subsequent calls hit the stringEnd fast path and return a slice of the original input (still containing backslashes). That makes repeated calls inconsistent.
Consider caching the rebuilt bytes (or rebuilding on demand when backslashes exist) and using them in the stringEnd path.
🔧 One way to make repeated calls consistent
@@
- private final List<Integer> backslashes = new ArrayList<>();
+ private final List<Integer> backslashes = new ArrayList<>();
+ private byte[] stringBytes;
@@
- if (stringEnd > 0) {
- final int len = stringEnd - 1 - _inputPtr;
- return new Text(new XContentString.UTF8Bytes(_inputBuffer, _inputPtr, len), stringLength);
- }
+ if (stringEnd > 0) {
+ if (stringBytes != null) {
+ return new Text(new XContentString.UTF8Bytes(stringBytes), stringLength);
+ }
+ final int len = stringEnd - 1 - _inputPtr;
+ return new Text(new XContentString.UTF8Bytes(_inputBuffer, _inputPtr, len), stringLength);
+ }
@@
- stringLength = 0;
- backslashes.clear();
+ stringLength = 0;
+ backslashes.clear();
+ stringBytes = null;
@@
- return new Text(new XContentString.UTF8Bytes(buff), stringLength);
+ stringBytes = buff;
+ return new Text(new XContentString.UTF8Bytes(buff), stringLength);🤖 Prompt for AI Agents
In
`@libs/x-content/impl/src/main/java/org/elasticsearch/xcontent/provider/json/ESUTF8StreamJsonParser.java`
around lines 52 - 57, getValueAsText() currently returns a raw slice from
_inputBuffer when stringEnd > 0 which preserves escape sequences, causing
inconsistency after _finishAndReturnText() rebuilds de-escaped bytes; modify
getValueAsText() to detect when the token contains backslashes (or when
_finishAndReturnText() previously rebuilt bytes) and return the cached
de-escaped bytes instead of the original slice: call or reuse
_finishAndReturnText() (or store its result in a field like cachedTextBytes)
when escapes are present, and in the stringEnd fast path return a Text backed by
those de-escaped bytes (use identifiers _finishAndReturnText, cached buffer
field, stringEnd, _inputBuffer, _inputPtr, stringLength, _tokenIncomplete to
locate code).
PR_017
Summary by CodeRabbit
Bug Fixes
Tests